суббота, 7 января 2023 г.

How to quickly find anomalies in number series using the Hampel method

Translated from Russian. The original article is here:

In practice, there are issues for the solution of which it is required to find anomalies in the numerical series. For ease of understanding, we can assume that these are values that differ from most numbers in the series in some way (outlier, non-standard value, deviation from the norm). Such tasks are found in various areas:

  • cleaning of noisy data in Data Science;
  • outlier filtering in the training sample for neural networks in Machine Learning;
  • search for abnormal network hacker activity, while monitoring traffic and events in Cybersecurity;
  • detection of outliers or tails in the stock data stream in Algorithmic Trading;
  • as well as in any anomaly search tasks, where data can be presented as a numerical series.

The concepts of a number series in mathematical analysis and in statistics are different. We accept a numerical series as its statistical understanding, that is, a finite sequence of numbers (analogous to a sample). There are various interpretations of the anomaly in the numerical series. We will consider them further.

The article also shows examples of how to find anomalies quickly and efficiently in numerical series using the modified Hampel method (Hampel F.R.).

The concept of the number series anomaly

For one-dimensional data, outlier estimation is extensively researched in the statistical literature. Traditionally, the most useful estimates for characterizing the variance of data are the sample mean and variance of the sample. They give a good estimate of the location and scatter of the data in the sample, but only if the outliers "do not pollute" it. Even if the data contains only one observation that is significantly different from the rest in the sample, then the sample mean can also deviate significantly from the sample mean without this outlier.

To measure the robustness of a chosen statistical estimation method against outliers, Hampel introduced the concept of a breakdown point. The breakdown point is the largest percentage of "polluted data" (the proportion of incorrect observations or outliers) that the chosen method can process before it starts to produce an incorrect result. These can be arbitrarily large aberrant (abnormal, erroneous) values.

Intuitively, one can understand that the breakdown point cannot exceed 50%, because if more than half of the observations are "polluted", then it is impossible to distinguish the main distribution of the series from the polluting distribution. Therefore, the maximum outlier fraction for the breakdown point is 0.5. There are examples of statistics that reach the maximum value for the breakdown point. For example, it is 0.5 for Median. And for the sample mean, it is equal to 1/n, where n is the sample size.

It is generally believed that the higher the breakdown point value, the more reliable the statistical evaluation method. A statistic with a high breakdown point value is sometimes called a robust (resistant) statistic.

To estimate more reliably the location and scatter of sample elements, it is often recommended to combine the Median and the Median Absolute Deviation (MAD). It is they that underlie the method of filtering the sample for outliers according to Hampel [1].

The MAD value is calculated like this:

MAD (X) = Median ( | x1 − Median (X) | , …, | xn − Median (X) | ),                                                ( 1 )

where X is a sample of n observations x1, …, xn.

According to Hampel, a sample outlier is a number x from a sample X, the modulus of the difference of which and the sample median is greater than the MAD of the sample, calculated by formula (1) and multiplied by a special coefficient k (scale factor), depending on the distribution (≈1.4826 for normal) [2].

In practice, to build a Hampel outlier filter, this definition is modified using sliding windows and the parameter s (sigma), a threshold number of standard deviations. This means that the median and MAD are calculated for all sliding windows consisting of the elements of the sample. Further, all modules of the difference between the elements of the sample and the median are compared with MAD, multiplied by the coefficient k and multiplied by the parameter s [3]. A higher standard deviation threshold s makes the filter more sparing; a lower threshold identifies more numbers as outliers.

Now let's define the basic concept: what we mean by an anomaly in a finite number series for practical issues.

Definition 1. The anomaly of a number series is a number x from the series X, which modulo differs from the median of the sliding window by more than s times from the MAD of this window, multiplied by the coefficient k.

To generalize the anomaly filter as much as possible and abstract it from the practical issue being solved, we propose the following set-theoretic interpretation.

All anomalies of the series form some of its subset A:

A = { a | | a − Median (Xi) | > s ∙ k ∙ MAD (Xi},                                                                              ( 2 )

where a is an anomalous element from X calculated according to Definition 1, X is the initial number series, Xi  is the number series of the i-th sliding window, total windows: (n − w + 1), where w is the window size.

The anomalies filter of the number series is defined as a function:

F: X → { True, False }.  F (xi) = True, if xi  A; False, if xi  A.                                                       ( 3 )

Here X is the original number series consisting of n elements xi, A is the set of anomalies (2).

Filtering the number series for anomalies by the Hampel method

To filter a number series for the presence of anomalies in it, it is proposed to use the Python implementation of the HampelFilter() function. This function allows you to detect anomaly among the values of any number series using the Hampel filtering method. The filter detects anomalies determined according to (2). Configurable parameters in this filter function: window (variable w from the formulas above) — size of the sliding window (5 by default), sigma (variable s) — threshold number of standard deviations (3 by default), scaleFactor (variable k) — special coefficient, depending on the distribution of the series (1.4826 by default).

The HampelFilter() filtering function returns a new series F, according to (3) consisting of True or False values, where True means the presence of an anomalous element at the corresponding position in the original series.

Input examples and results, HampelFilter(window=5, sigma=3, scaleFactor=1.4826)

Input data                 Function output

[10, 10, 10, 10, 10]       [False, False, False, False, False]
[1, 10, 10, 10, 10]        [True, False, False, False, False]
[1, 5, 10, 10, 10]         [True, True, False, False, False]
[1, 5, 1, 1, 1]            [False, True, False, False, False]

Input examples and results, HampelFilter(window=3, sigma=3, scaleFactor=1.4826)

Input data                 Function output

[1, 5, 1, 1, 1]            [False, True, False, False, False]
[1, 10, 10, 1, 10, 1]      [True, False, False, False, False, False]
[1, 10, 10, 10, 10, 1]     [True, False, False, False, False, True]
[1, 1, 1, 10, 10, 10]      [False, False, False, False, False, False]

Finding the index of the first anomalous element

In practice, it is often necessary to determine only the first anomalous or the first among the largest elements in the number series. To do this, you can use the Python implementation of the HampelAnomalyDetection() function. This function returns the minimum index of an element in the list of found anomalies, or the index of the first maximum element in the input series if its index is less than the index of the anomalous element. If there are no anomalies in the number series, or if all elements are equal (no maximums), then None is returned.

Input examples and results, HampelAnomalyDetection() with default values

Input data                 Function output

[1, 1, 1, 1, 111, 1]       4
[1, 1, 10, 1, 1, 1]        2
[111, 1, 1, 1, 1, 1]       0
[111, 1, 1, 1, 1, 111]     0
[1, 11, 1, 111, 1, 1]      1
[1, 1, 1, 111, 99, 11]     3
[-111, 1, 1, 1, 1]         0
[1, 2, 1, -1, 1]           1
[1]                        None
[1, 2]                     None
[1, 1, 1, 1, 1, 1]         None

Filtering stock prices for anomalies

You can see the use of the above functions using the example of the task of finding outliers in stock data. To do this, we generated time series with data similar to random stock prices and anomalies in them using the PriceGenerator library, and then applied these functions to the data. The functions themselves are in the TKSBrokerAPI library (it contains TradeRoutines module).

The solution to the issue was implemented on Jupyter Notebook: HampelFilteringExample.ipynb. After analysis by the Hampel method anomalies on the price chart can be marked, for example, as shown in the illustration below.

As can be seen with the naked eye, there are clearly outliers in the series, tails at the top and bottom of some candles. We are not interested in all such outliers, but only in tails that are too long. For us, these are signs of a price anomaly. The Hampel filter coped with the detection of anomalies in a series of 75 candles with a sliding window value equal to the series size, the number of standard deviations equal to 3 (sigma parameter) and a constant coefficient characterizing the normal distribution equal to 1.4826 (scale factor parameter). Similarly, you can analyze for anomalies the size of the bodies of the candles.

Conclusions

Of course, Hampel's method also has its drawbacks. For example, when implementing the algorithm, one may encounter the fact that the anomalies present in the first and last elements of the series according to the original algorithm will be ignored. This is due to the use of a sliding window. To workaround the issue, we had to pre-expand the number series in front and behind by the size of this window. Only after that it was possible to detect anomalies both in the first and in the last elements of the initial series.

However, in practice, the Hampel filter is extremely effective. Thanks to modern libraries such as Pandas, it handles fairly large numerical series of tens of thousands elements in a matter of seconds, and is also easy to optimize and parallelize (using CUDA Python and Numba’s @cuda.jit decorator, or Python Multiprocessing ThreadPool).

Thus, if you have an issue of finding anomalies in a number series, try looking towards filtering them using the Hampel method. It gives fast results and is also easy to understand and implement.


Authors: Timur Gilmullin, Ph.D., Mansur Gilmullin, Ph.D.

Sources:

  1. Hampel F. R. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69, 382–393, 1974.
  2. Hancong Liu, Sirish Shah and Wei Jiang. On-line outlier detection and data cleaning. Computers and Chemical Engineering. Vol. 28, March 2004, pp. 1635–1647.
    Link: https://sites.ualberta.ca/~slshah/files/on_line_outlier_det.pdf
  3. Lewinson Eryk. Outlier Detection with Hampel Filter. September 26, 2019.
    Link: https://towardsdatascience.com/outlier-detection-with-hampel-filter-85ddf523c73d

Source code of the modules used in examples: